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Abstract — Compressed sensing aims to undersample certain high- 
dimensional signals, yet accurately reconstruct them by exploiting signal 
characteristics. Accurate reconstruction is possible when the object to 
be recovered is sufficiently sparse in a known basis. Currently, the best 
known sparsity-undersampling tradeoff is achieved when reconstructing 
by convex optimization - which is expensive in important large-scale 
applications. 

Fast iterative thresholding algorithms have been intensively studied 
as alternatives to convex optimization for large-scale problems. Un- 
fortunately known fast algorithms offer substantially worse sparsity- 
undersampling tradeoffs than convex optimization. 

We introduce a simple costless modification to iterative thresholding 
making the sparsity-undersampling tradeoff of the new algorithms equiv- 
alent to that of the corresponding convex optimization procedures. The 
new iterative-thresholding algorithms are inspired by belief propagation 
in graphical models. 

Our empirical measurements of the sparsity-undersampling tradeoff 
for the new algorithms agree with theoretical calculations. We show 
that a state evolution formalism correctly derives the true sparsity- 
undersampling tradeoff. There is a surprising agreement between earlier 
calculations based on random convex polytopes and this new, apparently 
very different theoretical formalism. 

I. Introduction and overview 

Compressed sensing refers to a growing body of techniques that 
'undersample' high-dimensional signals and yet recover them accu- 
rately |T), 0- Such techniques make fewer measurements than tra- 
ditional sampling theory demands: rather than sampling proportional 
to frequency bandwidth, they make only as many measurements as 
the underlying 'information content' of those signals. However, as 
compared with traditional sampling theory, which can recover signals 
by applying simple linear reconstruction formulas, the task of signal 
recovery from reduced measurements requires nonlinear, and so far, 
relatively expensive reconstruction schemes. One popular class of 
reconstruction schemes uses linear programming (LP) methods; there 
is an elegant theory for such schemes promising large improvements 
over ordinary sampling rules in recovering sparse signals. However, 
solving the required LPs is substantially more expensive in applica- 
tions than the linear reconstruction schemes that are now standard. In 
certain imaging problems, the signal to be acquired may be an image 
with 10 6 pixels and the required LP would involve tens of thousands 
of constraints and millions of variables. Despite advances in the speed 
of LP, such problems are still dramatically more expensive to solve 
than we would like. 

This paper develops an iterative algorithm achieving reconstruction 
performance in one important sense identical to LP-based reconstruc- 
tion while running dramatically faster. We assume that a vector yofn 
measurements is obtained from an unknown iV-vector xo according 
to y = Axo, where A is the n x N measurement matrix n < N. 
Starting from an initial guess x = 0, the first order approximate 
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message passing (AMP) algorithm proceeds iteratively according to: 

x t+1 = t^fAV+l'), (1) 

t A t . 1 t 

z = y — Ax + —z 
o 

are scalar threshold functions (applied componentwise), 
x" t IK" is the current estimate of xo, and z l £ R" is the 
current residual. A* denotes transpose of A. For a vector u = 
(u(l), . . . , u(N)), (u) = Eti Finally V ' t ( s ) = s). 

Iterative thresholding algorithms of other types have been popular 
among researchers for some years, the focus being on schemes of 
the form 
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Such schemes can have very low per-iteration cost and low storage 
requirements; they can attack very large scale applications, - much 
larger than standard LP solvers can attack. However, 0-0 fall short 
of the sparsity-undersampling tradeoff offered by LP reconstruction 


Iterative thresholding schemes based on (3), lack the crucial 
term in (2) - namely, f z' -1 {n't (A* z l ~ x + a-' -1 )) is not included. 
We derive this term from the theory of belief propagation in graph- 
ical models, and show that it substantially improves the sparsity- 
undersampling tradeoff. 

Extensive numerical and Monte Carlo work reported here shows 
that AMP, defined by eqns |T), achieves a sparsity-undersampling 
tradeoff matching the theoretical tradeoff which has been proved 
for LP-based reconstruction. We consider a parameter space with 
axes quantifying sparsity and undersampling. In the limit of large 
dimensions N, n, the parameter space splits in two phases: one where 
the MP approach is successful in accurately reconstructing xo and 
one where it is unsuccessful. References (4), 0, HJ derived regions 
of success and failure for LP-based recovery. We find these two os- 
tensibly different partitions of the sparsity-undersampling parameter 
space to be identical. Both reconstruction approaches succeed or fail 
over the same regions, see Figure [T] 

Our finding has extensive empirical evidence and strong theoretical 
support. We introduce a state evolution formalism and find that it 
accurately predicts the dynamical behavior of numerous observables 
of the AMP algorithm. In this formalism, the mean squared error 
of reconstruction is a state variable; its change from iteration to 
iteration is modeled by a simple scalar function, the MSE map. When 
this map has nonzero fixed points, the formalism predicts that AMP 
will not successfully recover the desired solution. The MSE map 
depends on the underlying sparsity and undersampling ratios, and can 
develop nonzero fixed points over a region of sparsity/undersampling 



space. The region is evaluated analytically and found to coincide very 
precisely (ie. within numerical precision) with the region over which 
LP-based methods are proved to fail. Extensive Monte Carlo testing 
of AMP reconstruction finds the region where AMP fails is, to within 
statistical precision, the same region. 

In short we introduce a fast iterative algorithm which is found to 
perform as well as corresponding linear programming based methods 
on random problems. Our findings are supported from simulations 
and from a theoretical formalism. 

Remarkably, the success/failure phases of LP reconstruction were 
previously found by methods in combinatorial geometry; we give 
here what amounts to a very simple formula for the phase boundary, 
derived using a very different and seemingly elegant theoretical 
principle. 

A. Underdetermined Linear Systems 

Let xo £ R N be the signal of interest. We are interested in 
reconstructing it from the vector of measurements y = Axo, with 
y £ R n , for n < N. For the moment, we assume the entries Aij of 
the measurement matrix are independent and identically distributed 
normal N(0, 1/n). 

We consider three canonical models for the signal xo and three 
nonlinear reconstruction procedures based on linear programming. 

+: xq is nonnegative, with at most k entries different from 0. 
Reconstruct by solving the LP: minimize Yli=i Xi subject to x > 0, 
and Ax = y. 

±: xo has as many as k nonzero entries. Reconstruct by solving the 
minimum l\ norm problem: minimize ||a;||i, subject to Ax = y. This 
can be cast as an LP. 

□ : xo £ [—1, 1]^, with at most k entries in the interior (—1, 1). 
Reconstruction by solving the LP feasibility problem: find any vector 
x e [-1, +1]^ with Ax = y. 

Despite the fact that the systems are underdetermined, under certain 
conditions on k,n,N these procedures perfectly recover xo. This 
takes place subject to a sparsity-undersampling tradeoff namely an 
upper bound on the signal complexity k relative to n and N. 

B. Phase Transitions 

The sparsity-undersampling tradeoff can most easily be described 
by taking a large-system limit. In that limit, we fix parameters (S, p) 
in (0, l) 2 and let k,n,N — > oo with k/n — > p and n/N — > 5. 
The sparsity-undersampling behavior we study is controlled by (S, p), 
with S the undersampling fraction and p a measure of sparsity (with 
larger p corresponding to more complex signals). 

The domain (5, p) £ (0, l) 2 has two phases, a 'success' phase, 
where exact reconstruction typically occurs, and a 'failure' phase 
were exact reconstruction typically fails. More formally, for each 
choice of x 6 { + , ±, □} there is a function pcc(s x) whose graph 
partitions the domain into two regions. In the 'upper' region, where 
p > pcc(<5;x)> the corresponding LP reconstruction x\(x) fails to 
recover xo, in the following sense: as k, n, N — -> oo in the large 
system limit with k/n — > p and n/N — > S, the probability of exact 
reconstruction {xi(x) — x o} tends to zero exponentially fast. In the 
'lower' region, where p < pca{5;x)> LP reconstruction succeeds to 
recover xo, in the following sense: as k,n,N — > oo in the large 
system limit with k/n — > p and n/N — > S, the probability of exact 
reconstruction {xi(x) = ^o} tends to one exponentially fast. We 
refer to |4), (5), (7), (6) for proofs and precise definitions of the 
curves p CG (-; x)- 




Fig. 1. The phase transition lines for reconstructing sparse non-negative 
vectors (problem +, red), sparse signed vectors (problem ±, blue) and vectors 
with entries in [—1, 1] (problem □, green). Continuous lines refer to analytical 
predictions from combinatorial geometry or the state evolution formalisms. 
Dashed lines present data from experiments with the AMP algorithm, with 
signal length N = 1000 and T = 1000 iterations. For each value of S, we 
considered a grid of p values, at each value, generating 50 random problems. 
The dashed line presents the estimated 50th percentile of the response curve. 
At that percentile, the root mean square error after T iterations obeys or < 
10 — 3 in half of the simulated reconstructions. 



The three functions p CG (-;+), p CG (-;±), p CG (-',0) are shown 
in Figure 1; they are the red, blue, and green curves, respectively. 
The ordering p C c(<5; +) > Pcg(S; ±) (red > blue) says that knowing 
that a signal is sparse and positive is more valuable than only 
knowing it is sparse. Both the red and blue curves behave as 
Pcg(<5; +, ±) ~ (2 log(l/5)) _1 as S — > 0; surprisingly large amounts 
of undersampling are possible, if sufficient sparsity is present. In 
contrast, p C c(S;0) — (green curve) for S < 1/2 so the bounds 
[—1, 1] are really of no help unless we use a limited amount of 
undersampling, i.e. by less than a factor of two. 

Explicit expressions for p C a(S;+,±) are given in |4), (5); they 
are quite involved and use methods from combinatorial geometry. 
By Finding 1 below, they agree to within numerical precision to the 
following formula: 



Pse(S; x) = max 

2>0 



l-( Kx /5)[(l + z 2 M-z)-z^(z)] 

1+Z 2 -K X [(1 + Z 2 )<$>(-Z) - Z<t>(z)] 



(5) 



principal result of this paper, uses methods unrelated to combinatorial 
geometry. 

C. Iterative Approaches 

Mathematical results for the large-system limit correspond well 
to application needs. Realistic modern problems in spectroscopy 
and medical imaging demand reconstructions of objects with tens 
of thousands or even millions of unknowns. Extensive testing of 
practical convex optimizers in these problems (8) has shown that the 
large system asymptotic accurately describes the observed behavior 
of computed solutions to the above LPs. But the same testing shows 
that existing convex optimization algorithms run slowly on these large 
problems, taking minutes or even hours on the largest problems of 
interest. 

Many researchers have abandoned formal convex optimization, 
turning to fast iterative methods instead (9), 1101 . 1111 . 



The iteration Q~)-(2) is very attractive because it does not require 
the solution of a system of linear equations, and because it does 
not require explicit operations on the matrix A; it only requires 
that one apply the operators A and A* to any given vector. In a 
number of applications - for example Magnetic Resonance Imaging 
- the operators A which make practical sense are not really Gaussian 
random matrices, but rather random sections of the Fourier transform 
and other physically-inspired transforms [2), 1121 . Such operators can 
be applied very rapidly using FFTs, rendering the above iteration 
extremely fast. Provided the process stops after a limited number of 
iterations, the computations are very practical. 

The thresholding functions {?7t (•)}*>(> m these schemes depend 
on both iteration and problem setting. In this paper we consider 
rjt{ ■ ) = rj{-\ Aat, x)> where A is a threshold control parameter, \ 
{+, ±, □} denotes the setting, and of = Ave i E{(s i (j) - x (j)) 2 } 
is the mean square error of the current current estimate x l (in practice 
an empirical estimate of this quantity is used). 

For instance, in the case of sparse signed vectors (i.e. problem 
setting ±), we apply soft thresholding r\t{u) = rj(u; Act, ±), where 



{(u — Act) if u > Act, 
(u + Act) if it < —Act, 
otherwise, 



(6) 



where we dropped the argument ± to lighten notation. Notice that 
r) t depends on the iteration number t only through the mean square 
error (MSE) of. 

D. Heuristics for Iterative Approaches 

Why should the iterative approach work, i.e. why should it 
converge to the correct answer xo? The case ± has been most 
discussed and we focus on that case for this section. Imagine first 
of all that A is an orthogonal matrix, in particular A* = A~ . 
Then the iteration JT] - J2] stops in 1 step, correctly finding xq. Next, 
imagine that A is an invertible matrix; 1131 , has shown that a related 
thresholding algorithm with clever scaling of A* and clever choice of 
threshold, will correctly find xq . Of course both of these motivational 
observations assume n = TV, so we are not really imtfersampling. 

We sketch a motivational argument for thresholding in the truly 
undersampled case n < N which is statistical, which has been 
popular with engineers 1121 and which leads to a proper 'psychology' 
for understanding our results. Consider the operator H = A* A — I, 
and note that A*y = xo + Hxq. If A were orthogonal, we would 
of course have H = 0, and the iteration would, as we have seen 
immediately succeed in one step. If A is a Gaussian random matrix 
and n < N, then of course A is not invertible and A* is not A -1 . 
Instead of Hxo = 0, in the undersampled case Hxo behaves as a 
kind of noisy random vector, i.e. A*y = xo + noise. Now xo is 
supposed to be a sparse vector, and, one can see, the noise term 
is accurately modeled as a vector with i.i.d. Gaussian entries with 
variance 1 1| rco || i - 

In short, the first iteration gives us a 'noisy' version of the sparse 
vector we are seeking to recover. The problem of recovering a sparse 
vector from noisy measurements has been heavily discussed 1141 and 
it is well understood that soft thresholding can produce a reduction 
in mean-squared error when sufficient sparsity is present and the 
threshold is chosen appropriately. Consequently, one anticipates that 
x 1 will be closer to xo than A*y. 

At the second iteration, one has A*(y — Ax 1 ) = xo + H(xo — 
x 1 ). Naively, the matrix H does not correlate with xo or x , and 
so we might pretend that H(xq — x 1 ) is again a Gaussian vector 
whose entries have variance n -1 ||s;o — xl \\%. This 'noise level' is 




Fig. 2. Development of fixed points for formal MSE evolution. Here we plot 
>&(cr 2 ) - a 2 where • ) is the MSE map for \ = + (left column), \ = ± 
(center column) and \ = □ (right column), 5 = 0.1 (upper row,x G {+, ±}), 
8 = 0.55 (upper row,x = d), S = 0.4 (lower row,x € {+,±}) and 
<5 = 0.75 (lower row,x = D). A crossing of the y-axis corresponds to a 
fixed point of *B, If the graphed quantity is negative for positive a 2 , has 
no fixed points for a > 0. Different curves correspond to different values of 
p: where p is respectively less than, equal to and greater than p S E- In each 
case, has a stable fixed fixed point at zero for p < p S E, and no other 
fixed points, an unstable fixed point at zero for p = p S E and devlops two 
fixed points at p > p S E- Blue curves correspond to p = p S e(5;x)> green to 
p = 1.05 • Pse(<5; x)> red to p = 0.95 ■ Pse(<5; x)- 



smaller than at iteration zero, and so thresholding of this noise can 
be anticipated to produce an even more accurate result at iteration 
two; and so on. 

There is a valuable digital communications interpretation of this 
process. The vector w = Hx^ is the cross-channel interference 
or mutual access interference (MAI), i.e. the noiselike disturbance 
each coordinate of A*y experiences from the presence of all the 
other 'weakly interacting' coordinates. The thresholding iteration 
suppresses this interference in the sparse case by detecting the many 
'silent' channels and setting them a priori to zero, producing a 
putatively better guess at the next iteration. At that iteration, the 
remaining interference is proportional not to the size of the estimand, 
but instead to the estimation error, i.e. it is caused by the errors in 
reconstructing all the weakly interacting coordinates; these errors are 
only a fraction of the sizes of the estimands and so the error is 
significantly reduced at the next iteration. 

E. State Evolution 

The above 'sparse denoising'/'interference suppression' heuristic, 
does agree qualitatively with the actual behavior one can observe 
in sample reconstructions. It is very tempting to take it literally. 
Assuming it is literally true that the MAI is Gaussian and independent 
from iteration to iteration, we can can formally track the evolution, 
from iteration to iteration, of the mean-squared error. 

This gives a recursive equation for the formal MSE, i.e. the MSE 
which would be true if the heuristic were true. This takes the form 



CT t 2 +1 = *(<T t 2 ) . 



*(ct 2 ) =. E{ [r,(X+ ^=Z; Act) - X] 2 } 



(7) 
(8) 



Here expectation is with respect to independent random variables 
Z ~ N(0, 1) and X, whose distribution coincides with the empirical 



distribution of the entries of xo- We use soft thresholding l|6j if the 
signal is sparse and signed, i.e. if \ = ±. In the case of sparse non- 
negative vectors, \ — +> we W1 H l et v( u ~t ^°"> +) ~ max(u — Act, 0). 
Finally, for \ = □, we let f][u; □) = sign(tt) min(|M|, 1). Calcula- 
tions of this sort are familiar from the theory of soft thresholding of 
sparse signals; see the Supplement for details. 
We call * : o 2 h-> *(a 2 ) the MSE map. 

Definition LI. Given implicit parameters (x,S, p,\, F), with F — 
Fx the distribution of the random variable X. State Evolution is the 
recursive map (one-dimensional dynamical system): a 2 \— > ^{o 2 ). 

Implicit parameters (x, 5, p, A, F ) stay fixed during the evolution. 
Equivalently, the full state evolves by the rule 

(a 2 ; X ,8, p, \, F x ) ^ (<K(a 2 ); X ,S, PA, Fx) ■ 
Parameter space is partitioned into two regions: 

Region (I): *(a 2 ) < a 2 for all a 2 G (0,EX 2 ]. Here of ^ as 
t — > oo: the SE converges to zero. 

Region (II): The complement of Region (I). Here, the SE recursion 
does not evolve to a 2 — 0. 

The partitioning of parameter space induces a notion of sparsity 
threshold, the minimal sparsity guarantee needed to obtain conver- 
gence of the formal MSE: 

p SE (<5; x, K Fx) = sup {p : (5, p, A, Fx) £ Region (I)} . (9) 

The subscript se stands for State Evolution. Of course, p SE depends on 
the case x £ { + • ± - D}; it also seems to depend also on the signal 
distribution Fx', however, an essential simplification is provided by 

Proposition 1.2. For the three canonical problems x £ { + ,±,n}> 
any 5 £ [0, 1], and any random variable X with the prescribed spar- 
sity and bounded second moment, Pse(5; x, A, Fx) is independent of 
Fx- 

Independence from F allows us to write Pse(5; x, A) for the 
sparsity thresholds. The proof of this statement is sketched below, 
along with the derivation of a more explicit expression. Adopt the 
notation 

Pse{5;x) = supp SE (S;x, A). (10) 
\>o 

High precision numerical evaluations of such expression uncovers the 
following very suggestive 

Finding 1. For the three canonical problems x £ {+!±iC}> an d 
for any 6 £ (0, 1) 

Pse(S;x) = Pcc(<5;x) ■ (11) 

In short, the formal MSE evolves to zero exactly over the same 
region of (5, p) phase space as does the phase diagram for the 
corresponding convex optimization! 

F. Failure of standard iterative algorithms 

If we trusted that formal MSE truly describes the evolution of the 
iterative thresholding algorithm, Finding Q] would imply that iterative 
thresholding allows to undersample just as aggressively in solving 
underdetermined linear systems as the corresponding LR 

FindingQ]gives new reason to hope for a possibility that has already 
inspired many researchers over the last five years: the possibility of 
finding a very fast algorithm that replicates the behavior of convex 
optimization in settings +, ±, □. 



Unhappily the formal MSE calculation does not describe the 
behavior of iterative thresholding: 

1. State Evolution does not predict the observed properties of iterative 
thresholding algorithms. 

2. Iterative thresholding algorithms, even when optimally tuned, do 
not achieve the optimal phase diagram. 

In J3), two of the authors carried out an extensive empirical study 
of iterative thresholding algorithms. Even optimizing over the free 
parameter A and the nonlinearity r\ the phase transition was observed 
at significantly smaller values of p than those observed for LP-based 
algorithms. 

Numerical simulations also show very clearly that the MSE map 
does not describe the evolution of the actual MSE under iterative 
thresholding. The mathematical reason for this failure is quite simple. 
After the first iteration, the entries of x become strongly dependent, 
and State Evolution does not predict the moments of x l . 

G. Message Passing Algorithm 

The main surprise of this paper is that this failure is not the end of 
the story. We now consider a modification of iterative thresholding 
inspired by message passing algorithms for inference in graphical 
models 1161 . and graph-based error correcting codes 1171 . 1181 . 
These are iterative algorithms, whose basic variables ('messages') are 
associated to directed edges in a graph that encodes the structure of 
the statistical model. The relevant graph here is a complete bipartite 
graph over N nodes on one side ('variable nodes'), and n on the 
others ('measurement nodes'). Messages are updated according to 
the rules 

xlt\ = Vt( E A bi Z b-i) , (12) 
be[n]\a 

z a^i = Va ^ A a jXj^ a , (13) 

J'6[p]\i 

for each (i,a) G [N] x [n]. We will refer to this algorifhrrQ as to 
MP. 

MP has one important drawback with respect to iterative thresh- 
olding. Instead of updating N estimates, at each iterations we need 
to update Nn messages, thus increasing significantly the algorithm 
complexity. On the other hand, it is easy to see that the right-hand 
side of eqn 1121 depends weakly on the index a (only one out 
of n terms is excluded) and that the right-hand side of eqn 1121 
depends weakly on i. Neglecting altogether this dependence leads to 
the iterative thresholding equations (31. l4l. A more careful analysis 
of this dependence leads to corrections of order one in the high- 
dimensional limit. Such corrections are however fully captured by 
the last term on the right hand side of eqn |2), thus leading to the 
AMP algorithm. Statistical physicists would call this the 'Onsager 
reaction term'; see 1241 . 

H. State Evolution is Correct for MP 

Although AMP seems very similar to simple iterative thresholding 
(3)-(D, SE accurately describes its properties, but not those of the 
standard iteration. As a consequence of Finding [T] properly tuned 
versions of MP-based algorithms are asymptotically as powerful as 
LP reconstruction. 

'For earlier applications of MP to compressed sensing see 1191 . 1201 . 1211 . 
Relations between MP and LP were explored in a number of papers, see for 
instance 1221 . 1231 . albeit from a different perspective. 



Comparison of Different Algorithms 




Fig. 3. Observed phase transitions of reconstruction algorithms. Algorithms 
studied include iterative soft and hard thresholding, orthogonal matching 
pursuit, and related. Parameters of each algorithm are tuned to achieve the 
best possible phase transition (3J. Reconstructions signal length N = 1000. 
Iterative thresholding algorithms used T = 1000 iterations. Phase transition 
curve displays the value of p = k/n at which success rate is 50%. 



We have conducted extensive simulation experiments with AMP, 
and more limited experiments with MP, which is computationally 
more intensive (for details see the complementary material). These 
experiments show that the performance of the algorithms can be 
accurately modeled using the MSE map. Let's be more specific. 

According to SE, performance of the AMP algorithm is predicted 
by tracking the evolution of the formal MSE of via the recursion 
(7). Although this formalism is quite simple, it is accurate in the high 
dimensional limit. Corresponding to the formal quantities calculated 
by SE are the actual quantities, so of course to the formal MSE 



corresponds the true MSE N 



x —Xq 



Other quantities can be 



computed in terms of the state o~\ as well: for instance the true false 
alarm rate (N — : x (i) 7^ and xo(i) = 0} is predicted 

via the formal false alarm rate ¥{i lt (X + S~ 1/2 a t Z) / 0|X = 
0}. Analogously, the true missed-detection rate : x (i) = 

and xo(i) 7^ 0} is predicted by the formal missed-detection rate 
f{Vt(X + 5- 1/2 a t Z) = 0\X / 0}, and so on. 

Our experiments establish agreement of actual and formal quanti- 
ties. 

Finding 2. For the AMP algorithm, and large dimensions N, n, we 
obsen'e 

I. SE correctly predicts the evolution of numerous statistical prop- 
erties of x l -with the iteration number t. The MSE, the number of 
nonzeros in x , the number of false alarms, the number of missed 
detections, and several other measures all evolve in way that matches 
the state evolution formalism to within experimental accuracy. 

II. SE correctly predicts the success/failure to converge to the 
correct result. In particular, SE predicts no convergence when p > 
Pse(<5; X) ty> an d convergence if p < Psn{S; Xi X), This is indeed 
observed empirically. 

Analogous observations were made for MP. 

/. Optimizing the MP Phase Transition 

An inappropriately tuned version of MP/AMP will not perform 
well compared to other algorithms, for example LP-based recon- 



structions. However, SE provides a natural strategy to tune MP and 
AMP (i.e. to choose the free parameter A): simply use the value 
achieving the maximum in eqn 1101 . We denote this value by A x (<5), 
X £ {+, ±, □}, and refer to the resulting algorithms as to optimally 
tuned MP/AMP (or sometimes MP/AMP for short). They achieve the 
State Evolution phase transition: 

Pse(5;x) = Pse(<5;x, A x (<5)). 

An explicit characterization of X X (S), x £ {+,±} can be found in 
the next section. 

We summarize below the properties of optimally tuned AMP/MP 
within the SE formalism. 

Theorem 1.3. For S G [0,1], p < /5 S e(<5;x)> an d any associated 
random variable X, the formal MSE of optimally-tuned AMP/MP 
evolves to zero under SE. Viceversa, if p > /9 S e(<5;x)> the formal 
MSE does not evolve to zero. Further, for p < Pse(5; x)» there exists 
b = b(S, p) > with the following property. If a\ denotes the formal 
MSE after t SE steps, then, for all t > 



of < 0% exp( 



-bt). 



(14) 



II. Details About the MSE Mapping 



In this section, we sketch the proof of Proposition II. 21 the iterative 
threshold does not depend on the details of the signal distribution. 
Further, we show how to derive the explicit expression for Pse(S; x)> 
X £ {+, ±}, given in the introduction. 

A. Local Stability Bound 

The state evolution threshold p S e(<5;Xi^) ls me supremum of all 
p's such that the MSE map ^(o 2 ) lies below the a 2 line for all 
<7 2 > 0. Since *(0) = 0, for this to happen it must be true that the 
derivative of the MSE map at a 2 = smaller than or equal to 1. We 
are therefore led to define the following 'local stability' threshold: 



Pls(S;xA) = sup < p 



Ckr2 



< 1 



(15) 



The above argument implies that Pse(5; x, X) < Pls(S; x, X). 

Considering for instance x = +, we obtain the following expres- 
sion for the first derivative of $ 



d^2 



E$[^-(X 
a 



Act) 



Act) 



where 4>(z) is the standard Gaussian density at z and &(z) = 
floo 't'(z') dz' is the Gaussian distribution. 

Evaluating this expression as o 2 J, 0, we get the local stability 
threshold for x — + : 



Pls(<5;x>^) 



1-(k x /S)[(1 + z*)$(-z)-z<Kz)] 



1 + Z 2 -K X [(1 + Z 2 )$(-Z) - Z<j>{z)] 



i=xVs 



where n x is the same as in |5). Notice that Pls(<5; +, A) depends on 
the distribution of X only through its sparsity (i.e. it is independent 

of Fx). 

B. Tightness of the Bound and Optimal Tuning 

We argued that 2 _ < 1 is necessary for the MSE map to 

converge to 0. This condition turns out to be sufficient because the 
function a 2 1— > ^(f 2 ) is concave on R+. This indeed yields 



2 . d* 
Ut+1 ~ d^ 



(16) 



which implies exponential convergence to the correct solution 1141 . 
In particular we have 



(17) 



whence p SE (5;Xj^) is independent of Fx as claimed. 

To prove a 2 i— > ^(c 2 ) is concave, one proceeds by computing 
its second derivative. For instance, in the case \ — +. one needs to 
differentiate the expression given above for the first derivative. We 
omit details but point out two useful remark: (i) The contribution 
due to X = vanishes; (ii) Since a convex combination of concave 
functions is also concave, it is sufficient to consider the case in which 
X = x* deterministically. 

As a byproduct of this argument we obtain explicit expressions 
for the optimal tuning parameter, by maximizing the local stability 
threshold 

1 f l-( Kx /6)[(l + z 2 )$(-z)-z<Kz)] 

X+(o) — — = are max < -, -= 

W VS *>o \l + z 2 ~K x [(l + z 2 )$(-z)-z ( b(z)] 

Before applying this formula in practice, please read the important 
notice in Supplemental Information. 

III. Discussion 

A. Relation with Minimax Risk 

Let Tf 1 denote the class of probability distributions F supported 
on (—00,00) with ¥{X 7^ 0} < e, and let r/(x;X,±) denote the 
soft-threshold function (6j with threshold value A. The minimax risk 
1141 is defined as 



M ± (e)=inf sup E F {[rj(X + Z; \,±) - X] 2 } , 



(18) 



with A ± (e) the optimal A. The optimal SE phase transition and 
optimal SE threshold obey 

8 = M ± (pS), p = p SE (5;±). (19) 

An analogous relation holds between the positive case p$e(5; +), 
and the minimax threshold risk M + where F is constrained to be 
a distribution on [0,oo). Exploiting 1191 , Supporting Information 
proves that 



Pco(S) = p SE (<5)(l + o(l)), 



0. 



B. Other Message Passing Algorithms 

The nonlinearity r}{ • ) in AMP eqns Q, (2) might be chosen 
differently. For sufficiently regular such choices, the SE formalism 
might predict evolution of the MSE. One might hope to use SE to 
design 'better' threshold nonlinearities. 

The threshold functions used here are such that the MSE map 
a 2 1 — > ^(o~ 2 ) is monotone and concave. As a consequence, the phase 
transition line Pse(<5;x) f° r optimally tuned AMP is independent of 
the empirical distribution of the vector xq. State Evolution may be 
inaccurate without such properties. 

Where SE is accurate, it offers limited room for improvement 
over the results here. If p SE denotes a (hypothetical) phase transition 
derived by SE with any nonlinearity whatsoever, Supporting Infor- 
mation exploits 1191 to prove 



Pse(5;x) < p SE (<S;x)(l + o(l)), 



0. 



xe{+,±}. 



In the limit of high undersampling, the nonlinearities studied here 
offer essentially unimprovable SE phase transitions. Our reconstruc- 
tion experiments also suggest that other nonlinearities yield little 
improvement over thresholds used here. 



C. Universality 

The SE-derived phase transitions are not sensitive to the detailed 
distribution of coefficient amplitudes. Empirical results in Supporting 
Information find similar insensitivity of observed phase transitions for 
MP. 

Gaussianity of the measurement matrix A can be relaxed; Sup- 
porting Information finds that other random matrix ensembles exhibit 
comparable phase transitions. 

In applications, one often uses very large matrices A which are 
never explicitly represented, but only applied as operators; examples 
include randomly undersampled partial Fourier transforms. Support- 
ing Information finds that observed phase transitions for MP in the 
partial Fourier case are comparable to those for random A. 

Acknowledgements 

A. Montanari was partially supported by the NSF CAREER 
award CCF-0743978 and the NSF grant DMS-0806211, and thanks 
Microsoft Research New England for hospitality during completion of 
this work. A. Maleki was partially supported by NSF DMS-050530. 

References 

[I] D. L. Donoho, "Compressed Sensing," IEEE Transactions on Information 
Theory, Vol. 52, pp. 489-509, April 2006. 

[2] E. Candes, J. Romberg, T. Tao, "Robust uncertainty principles: Exact 
signal reconstruction from highly incomplete frequency information," 
IEEE Transactions on Information Theoty,Vol. 52, No. 2, pp. 489-509, 
February 2006. 

[3] A. Maleki, D. L. Donoho, "Optimally Tuned Iterative Thresholding 
Algorithms," submitted to IEEE journal on selected areas in signal 
processing, 2009. 

[4] D. L. Donoho, "High-Dimensional centrally symmetric polytopes with 
neighborliness proportional to dimension," Discrete and Computational 
Geometry, Vol 35,No. 4, pp. 617-652, 2006. 

[5] D. L. Donoho, J. Tanner, "Neighborliness of randomly-projected simplices 
in high dimensions," Proceedings of the National Academy of Sciences, 
Vol. 102, No. 27, p. 9452-9457, 2005. 

[6] D. L. Donoho, J. Tanner, "Counting faces of randomly projected hyper- 
cubes and orthants, with applications," ArXiv. 

[7] D. L. Donoho, J. Tanner, "Counting faces of randomly projected polytopes 
when the projection radically lowers dimension," /. Amer. Math. Soc. , Vol. 
22, pp. 1-53, 2009. 

[8] D. L. Donoho, J. Tanner, "Observed universality of phase transitions in 

high dimensional geometry, with implication for modern data analysis 

and signal processing," Phil. Trans. A, 2009. 
[9] K. K. Herrity, A. C. Gilbert, and J. A. Tropp, "Sparse approximation via 

iterative thresholding," Proc. ICASSP, Vol. 3, pp. 624-627, Toulouse, May 

2006. 

[10] J. A. Tropp, A. C. Gilbert," Signal recovery from random measurements 
via orthogonal matching pursuit," IEEE Transactions Information Theory, 
53(12),pp. 4655-4666, 2007. 

[II] P. Indyk, M. Ruzic, "Near optimal sparse recovery in the £1 norm," In 
49th Annual Symposium on Foundations of Computer Science, pp. 199- 
207, Philadelphia, PA, October 2008. 

[12] M. Lustig, D. L. Donoho, J. M. Santos, J. M. Pauly, "Compressed 
sensing MRI," IEEE Signal Processing Magazine, 2008. 

[13] I. Daubechies, M. Defrise and C. De Mol, "An iterative thresholding 
algorithm for linear inverse problems with a sparsity constraint," Com- 
munications on Pure and Applied Mathematics, Vol. 75, pp. 1412-1457, 
2004. 

[14] D. L. Donoho, I, M. Johnstone, "Minimax risk over i v balls," Prob. Th. 

and Rel. Fields, Vol. 99, pp. 277-303, 1994. 
[15] D. L. Donoho, I. M. Johnstone, "Ideal spatial adaptation via wavelet 

shrinkage," Biometrica, Vol. 81, pp. 425-455, 1994. 
[16] J. Pearl, Probabilistic reasoning in intelligent systems: networks of 

plausible inference, Morgan Kaufmann, San Francisco, 1988. 
[17] R. G. Gallager, Low-Density Parity-Check Codes, MIT 

Press, Cambridge, Massachusetts, 1963, Available online at: 

http : / / web . / gallager /www /pages /ldpc . pelf 



[18] T. J. Richardson, R. Urbanke, Modern coding theory, 
Cambridge University Press, Cambridge, 2008, Available online at: 
http : // It he www . epf 1 . ch/mct / index . php 

[19] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar and A. Kabbani, 
"Counter Braids: a novel counter architecture for per-flow measurement," 
SIGMETRICS, Annapolis, June 2008. 

[20] S. Sarvotham, D. Baron and R. Baraniuk, "Compressed sensing recon- 
struction via belief propagation," Preprint, 2006. 

[21] F. Zhang, H. Pfister,"On the iterative decoding of high-rate LDPC codes 
with applications in compressed sensing," arXiv:0903.2232v2, 2009. 

[22] M. J. Wainwright, T. S. Jaakkola and A. S. Willsky,"MAP estimation 
via agreement on trees: message-passing and linear programming," IEEE 
Transactions on Information Theory, Vol. 51, No. 11, pp. 3697-3717. 

[23] M. Bayati, D. Shah and M. Sharma, " Max-Product for maximum weight 
matching: convergence, correctness, and LP duality," IEEE Transactions 
on Information Theory, Vol. 54, No. 3, pp. 1241-1251, 2008. 

[24] D. J. Thouless, P. W. Anderson and R. G. Palmer, " Solution of solvable 
model of a spin glass," Phil. Mag., Vol. 35, pp. 593-601, 1977. 

[25] D. L. Donoho and I.M. Johnstone and J.C. Hoch and A.S. Stern, 
"Maximum Entropy and the Nearly Black Object," Journal of the Royal 
Statistical Society, Series B (Methodological), Vol. 54, pp. 41-81, 1992. 

[26] D. L. Donoho, J. Tanner, "Phase transition as sparse sampling theorems," 
IEEE Transactions on Information 77ieory,submitted for publication. 

[27] P. J. Bickel, "Minimax estimation of the mean of a normal distribution 
subject to doing well at a point," in Recent Advances in Statistics: Papers 
in Honor of Herman Chernoff on His Sixtieth Birthday, Academic Press, 
511-528, 1983. 

[28] B. Efron and T. Hastie and I. Johnstone and R. Tibshirani, "Least angle 
regression," Annals of Statistics, Vol.32, pp. 407-492, 2004. 

Appendix 

A. Important Notice 

Readers familiar with the literature of thresholding of sparse signals 
will want to know that an implicit rescaling is needed to match 
equations from that literature with equations here. Specifically, in 
the traditional literature, one is used to seeing expressions r](x; Act) 
in cases where ct is the standard deviation of an underlying normal 
distribution. This means the threshold A is specified in standard 
deviations, so many people will immediately understand values like 
of A = 2, 3 etc in terms of their false alarm rates. In the main text, 
the expression rj(x; Act) appears numerous times, but note that a is 
not the standard deviation of the relevant normal distribution; instead, 
the standard deviation of that normal is r = a/VS. It follows that 
A in the main text is calibrated differently from the way A would be 
calibrated in other sources, differing by a 5-dependent scale factor. 
If we let \" s d E denote the quantity Xse appropriately rescaled so 
that it is in units of standard deviations of the underlying normal 
distribution, then the needed conversion to sd units is 



Asb = ^SE 



Vs. 



(20) 



B. A summary of notation 

The main paper will be referred as DMM throughout this note. 
All the notations are consistent with the notations used in DMM. We 
will use repeatedly the notation e = Sp. 

C. State Evolution Formulas 

In the main text we mentioned Pse(5; x> A, Fx) is independent of 
Fx- We also mentioned a few formulas for p SE (S;x)- The goal of 
this section is to explain the calculations involved in deriving these 
results. First, recall the expression for the MSE map 



*((7 ) =E 



{(v(x 



+ 7l z ' Xa ' x) 



X 



) 2 } 



(21) 



We denote by dirj and dir\ the partial derivatives of r] with respect 
to its first and second arguments. Using Stein's lemma and the fact 



that dfrj(x; y, x) = almost everywhere, we get 



Vs 



(22) 



where we dropped the dependence of r)( • ) on the constraint x to 
simplify the formula. 

1) Case x — +•' In this case we have X > almost surely and 
the threshold function is 



■n(x-\a) 



(x — Act) if x > Act, 
otherwise. 



As a consequence dir](x\ Act) = —d2rj(x; Act) = l(x > Act) (almost 
everywhere). This yields 

V~5, 



».(j + ».)„(^. to)) 



As ct | 0, we have <J>(^(AT-Act)) -> 1 and (f>(^-(X- Act) 
if X > 0. Therefore, 



dCT^ 



(1 - p5) <P(-\V5) . 



The local stability threshold p LS (5;+,X) is obtained by setting 



da 2 !o 



= 1. 



In order to prove the concavity of ct 2 h- > *I/(ct 2 ) first notice that 
a convex combination of concave functions is concave and so it is 
sufficient to show the concavity in the case X = x > determin- 
istically. Next notice that, in the case x = 0, 4-^ is independent of 
ct 2 . A a consequence, it is sufficient to prove H d r ^ < where 



d( CT 2)2 

■VS, , A , n ,{V5, 



5 P = (1 + A 2 «5) *(^(x- Act)) ~\VS^(x- Act)) . 
Using $'(it) = <f)(u) and <f)'( u ) = — Ul ^( u )> we g e t 

for x > 0. 

2) Case x = ±- Here X is supported on (— oo, oo) with ¥{X ^ 
0} < e = pS. Recall the definition of soft threshold 

il(x; Act) = 



(x — Act) if x > Act, 
(x + Act) if x < —Act, 
otherwise. 



As a consequence dir)(x; Act) = > Act) and d2ri(x;Xa) 

— sign(a;)I(|a;| > Act). This yields 



do2 



(I + A 2 )e{ $ (^(X-Act)) + 
V(* + Act))} 



a ^jVS 



{^ ( X-Act)) + ^ ( X + Act))} 



^7) 



By letting a j we get 



S = (i + >?)p8+(l+>?) (l-pS)29(-\VS) 



do 2 

which yields the local stability threshold Pls(<5; ±, A) by | = 1. 

Finally the proof of the concavity of a 2 i— > ^(a 2 ) is completely 
analogous to the case \ — +• 

3) Case \ = C ; Finally consider the case of X supported on 
[—1, +1] with P{A' ^ { + 1, —1}} < e. In this case we proposed the 
following nonlinearity, 

( +1 if x- > +1, 
rj{x) = < a; if -1 < x < +1, 
[ -1 if x < -1. 

Notice that the nonlinearity does not depend on any threshold 
parameter. Since dir](x) = I(a; £ [—1, +1]), 



As a | we get 



1 



whence the local stability condition jpr| < 1 yields p L s(<5;D) = 
(2~0+- 

Concavity of a 2 i— ^ ^{u 2 ) immediately follows from the fact that 
3>(— (1 — x)) is non-increasing in a for x < 1 and <&(— — (1 + a;)) 
is non-decreasing for a; > — 1. Using the combinatorial geometry 
result of f6) we get 

Theorem A.l. For any S e [0, 1], 

p co (5;0) = PsE (5;a) = p LS (S;D) = max {0, 2 - .T 1 } . (24) 

£>. Relation to Minimax Thresholding 

1 ) Minimax Thresholding Policy: We denote by the collection 
of all CDF's supported in [0, oo) and with F(0) > 1 — e, and by 
Tf 1 the collection of all CDF's supported in (—00,00) and with 
F(0+) - F(O-) > 1 - e. For x £ {+,±}> define the minimax 
threshold MSE 

M*(e;x) = mf sup E F { V (X + Z; A, x ) - A) 2 } , (25) 

where Ef denote expectation with respect to the random variable A 
with distribution F, and r)(x; A) = sign(a;)(|a-| — A)+ for \ — ± 
and rj(x;X) = (x — A)+ for x = +• Minimax Thresholding was 
discussed for the case \ = + m 1251 and for \ = i m H4I . 1151 . 

This machinery gives us a way to look at the results derived above 
in very commonsense terms. Suppose we know 8 and p but not 
the distribution F of A. Let's consider what threshold one might 
use, and ask at each given iteration of SE, the threshold which 
gives us the best possible control of the resulting formal MSE. That 
best possible threshold A* is by definition the minimax threshold at 
nonzero fraction e — pS, appropriately scaled by the effective noise 
level t — o/\/~8, 

X* = X* {p ■ 5;x) • a / VI, 

where x i + ii} depending on the case at hand. Note that this 
threshold does not depend on F. It depends on iteration only through 



the effective noise level at that iteration. The guarantee we then get for 
the formal MSE is the minimax threshold risk, appropriately scaled 
by the square of the effective noise level: 

'2 



MSE < M*(p6; X )-t* = M* (pS; X ) — : 



(26) 



for x £ {+,±j- This guarantee gives us a reduction in MSE over 
the previous iteration if and only if the right-hand side in Eq. J26b is 
smaller than o 2 , i.e. if and only if 

M*(p8;x)<S, XG{+,±}- 

In short, we can use state evolution with the minimax threshold, 
appropriately scaled by effective noise level, and we get a guaran- 
teed fractional reduction in MSE at each iteration, with fractional 
improvement 



umm(S,p; X ) = (1 - M*(pS;x)/8); 
hence the formal SE evolution is bounded by: 



a 2 < ll>mm(S, p; x)* ■ EA 2 



i=l,2, 



(27) 



(28) 



Results analogous to those of the main text hold for this minimax 
thresholding policy. That is, we can define a minimax thresholding 
phase transition such that below that transition, state evolution with 
minimax thresholding converges: 

Pmm(8; X ) = sup{p : M*(p5; X ) < 5}; x G {+,±}- 

Theorem A.2. Under SE with the minimax thresholding policy 
described above, for each (8,p) in (0, l) 2 obeying p < Pmm(8',x)> 
and for every marginal distribution F £ Tf, the formed MSE evolves 
to zero, with dynamics bounded by \27\ - \28\ . 

2) Relating Optimal Thresholding to Minimax Thresholding: An 
important difference between the optimal threshold defined in the 
main text and the minimax threshold is that A x = X x (5) depends 
only on the assumed 5 - no specific p need be chosen while minimax 
thresholding as defined above requires that one specify both S and 
p. However, since the methodology is seemingly pointless above the 
minimax phase transition, one might think to specify p = /9 M m(<5; x)- 
This new threshold A M m(<5;x) — X* (Spum(S); x) then requires no 
specification of p. As it turns out, the SE threshold coincides with 
this new threshold. 

Theorem A.3. For X G {+, ±} and S G [0, 1] 

M*(p5;x) = 8 if and only if p = p SE (S;x)- (29) 

Let A x (5) denote the minimax threshold defined in the main text, and 
let A x d (<5) denote denote the same quantity expressed in sd units \20\ . 
Then 



Xl d (5) = X*(p6), 



Pse(<5;x), 



XG {+,±} 



Proof: It is convenient to introduce the following explicit nota- 
tion for the MSE map: 



*(a 2 ; S, X, F) = E F { (r/( A + ^= Z; Xa) - A) 2 } 



(30) 



where Z ~ N(0, 1) is independent of A, and A ~ F. As above, 
we drop the dependency of the threshold function on x G {+,±} 
Since rj(ax; aX) = arj(x;X) for any positive a, we have the scale 
invariance 

<f(a 2 ; 5, X, F, X ) = 1, XV5, S sU2/a F), (31) 



where (S a F)(x) = F(x/a) is the operator that takes the CDF of an 
random variable X and returns the CDF of the random variable aX . 
Define 

1 2 

A<5,p;x) = jnf sup sup — ^(a ;5,\, F, X ) , (32) 

A>0 Fe:F x a 2 e(0,H. F {X 2 }] a 

where e = pS. It follows from the definition of se threshold that 
p < Pse(S;x) if an d onr y if J{$iP\x) < !• We first notice that by 
concavity of o 2 i— > ^(cr 2 ; 5, A, F, x), we have 

J{S,p;x) = inf sup sup — ^(cr 2 ; 5, A, F, x) (33) 

tr 2 >0 a 

= i inf sup sup *(1;1,A^,S, 1/2 ,F)(34) 
A Ferf <r 2 >Q 



5 A 



inf sup *(1;1,A,F) 



(35) 



where the second identity follows from the invariance property and 
the third from the observation that S a 3~f = for an y a > 0. 
Comparing with the definition l !25t . we finally obtain 



J(6,p; X ) = -M*(6p;x)- 



(36) 



Therefore p < p SE (5;x) if antl on ly 8 > M* (Sp; X ), which implies 
the thesis. ■ 

E. Convergence Rate of State Evolution 

The optimal thresholding policy described in the main text is the 
same as using the minimax thresholding policy but instead assuming 
the most pessimistic possible choice of p - the largest p that can 
possibly make sense. In contrast minimax thresholding is p-adaptive, 
and can use a smaller threshold where it would be valuable. Below the 
SE phase transition, both methods will converge, so what's different? 

Note that A SE (<5; x) antl ^mm(S, p; x) are dimensionally different; 
Ama/ is in standard deviation units. Converting A S e into sd units by 
l |20| l, we have Afg = A S e ■ <5 1//2 . Even after this calibration, we find 
that methods will generally use different thresholds, i.e. if p < p SE , 

Xmu{8,p;x) >~i£(5;x), xe{+,±}- 

In consequence, the methods may have different rates of convergence. 
Define the worst-case threshold MSE 

MSE(e, A;x) = sup E F {r)(X + Z; A) - X) 2 } 



and set 



M SE (S,p; X ) = MSE(Sp,\ s s £(5,x);x)- 



This is the MSE guarantee achieved by using Afg (<5) when in fact 
(8, p) is the case. Now by definition of minimax threshold MSE, 



M SE (8,p;x) > M*(5p- X ); 



(37) 



the inequality is generally strict. The convergence rate of optimal 
AMP under SE was described implicitly in the main text. We can 
give more precise information using this notation. Define 

u S e(5,p;x) = (1- M SE (8,p;x)/8); 

Then we have for the formal MSE of AMP 

o?<wte(J,p;x)*-]EX a , t= 1,2,3,... 

In the main text, the same relation was written in terms of exp(— bi), 
with b > 0; here we see that we may take b(5, p) = — log(a>sE(<5, p)). 



Explicit evaluation of this b requires evaluation of the worst-case 
thersholding risk MSE(e,A). Now by d37b we have 

use(S,p;x) > uuu{8,p;x), 

generally with strict inequality; so by using the p-adaptive threshold 
one gets better speed guarantees. 

F. Rigorous Asymptotic Agreement of SE and CG 

In this section we prove 
Theorem A.4. For x G {+, ±} 

Pcg{5\x) _ , 



lim 



(38) 



s^o pse(8;x) 

In words, Pca{8;x) is fh e phase transition computed by combi- 
natorial geometry (polytope theory) and Pse(S,x) obtained by state 
evolution: they are rigorously equivalent in the highly undersampled 
limit (i.e. 5^0 limit). In the main text, we only can make the 
observation that they agree numerically. 

1 ) Properties of the minimax threshold: We summarize here sev- 
eral known properties of the minimax threshold J25l >. which provide 
useful information about the behavior of SE. 

The extremal F achieving the supremum in Eq. ([25} is known. In 
the case x = +> it is a two-point mixture 

F+ = (l-e)S + e5 ll+(e) . (39) 

In the signed case x — ±> it is a three-point symmetric mixture 

(40) 



F, = (1- e)5 + - (<5 M ± (e) +5_ M ± (e) ). 

Precise asymptotic expressions for p x (e) are available. In particular, 

for X G {+,±}, 



(41) 



/i x (e) = v/21og(e)(l + o(l)) as e -> . 
We also know that 

M*(e; X ) =21og(e)(l + o(l)) as e -> . (42) 

2) Proof of Theorem \A.4\ Combining Theorem IA. 31 and Eg. ll42t, 

we get 

1 



Pse{8;p) 



21og(5)' 



0. 



(43) 



(correction terms that can be explicitly given). Now we know 
rigorously from [26) that the LP-based phase transitions satisfy a 
similar relationship: 



Theorem A.5 (Donoho and Tanner (26)). For x £ {+, ±} 

1 



Pcg(<5,x) 



0. 



21og(«5)' 

Combining now with Lemma 1431 we get Theorem IA.4I 



(44) 
□ 



G. Rigorous Asymptotic Optimality of Soft Thresholding 

The discussion in the main text, alluded to the possibility of im- 
proving on soft thresholding. Here we give a more formal discussion. 
We work in the situations X £ {+> ±}- Let rj denote some arbitrary 
nonlinearity with tuning parameter A. (For a concrete example, think 
of hard thresholding). We can define the minimax MSE for this 
nonlinearity in the natural way 

M(e; x) = inf sup E F {7j(X + Z; A) - X) 2 } , . (45) 

x Ferl 



there is a corresponding minimax threshold A(e;x)- We can deploy 
the minimax threshold in AMP by setting e — p8 and rescaling the 
threshold by the effective noise level r = cr/yS: 

actual threshold at iteration t = A(e; \) ' T 

= Hp$;x) ■ °t/Vs. 

Under state evolution, this is guaranteed to reduce the MSE provided 

M{pS; X ) < S. 

In that case we get full evolution to zero. It makes sense to define 
the minimax phase transition: 

p SE {S; X ) =sup{p : M(p5; X ) <S}; x£{+,±}- 

Whatever be F, for (8, p) with p < p SE (S), SE evolves the formal 
MSE of rj to zero. 

It is tempting to hope that some very special nonlinearity can do 
substantially better than soft thresholding. At least for the minimax 
phase transition, this is not so: 

Theorem A.6. Let p MM (<5;x) be a minimax phase transition com- 
puted under the State Evolution formalism for the cases x £ { + , ±) 
with some scalar nonlinearity rj. Let Pse(8; x) be the phase transition 
calculated in the main text for soft thresholding with corresponding 
optimal A. Then for x £ {+i i} 

lim MS,x) < L 

s^o Pse(<5;x) 

In words, no other nonlinearity can outperform soft thresholding 
in the limit of extreme undersampling - in the sense of minimax 
phase transitions. This is best understood using a notion from the 
main text. We there said that the parameter space (8, p, A, F) can 
be partitioned into two regions. Region (I) where there zero is the 
unique fixed point of the MSE map, and is a stable fixed point; 
and its complement, Region (II). Theorem IA.6I says that the range 
of p guaranteeing membership in Region (I) cannot be dramatically 
expanded by using a different nonlinearity. 

1) Some results on Minimax Risk: The proof depends on some 
know results about minimax MSE, where we are allowed to choose 
not just the threshold, but also the nonlinearity. For x £ { + i±}> 
define the minimax MSE 

M**(e; X ) = inf sup E F {lj{X + Z) - X) 2 } , (46) 

Here the minimization is over all measurable functions rj : R i— > R. 
Minimax MSE was discussed for the case x — + m 11251 and for 
X = ± in (27), QU, Q3]. It is known that 

M**(e; X )~21og( e - 1 ). t - 0. (47) 
H. Proof of Theorem \A.6\ 

Evidently, any specific nonlinearity cannot do better than the 
minimax risk: 

M*(e) > M**(e; X ). 
Consequently, if we put 

p**(S; X ) = sup{/9 : M**(5p; X ) < 5} 

then 

F(5,x)< P**(5,X)- 
From l |47t and the last two displays we conclude 

^ (5;x) - x), 

Theorem IA.6I is proven. □ 



/. Data Generation 

For a given algorithm with a fully specified parameter vector, we 
conduct one phase transition measurement experiment as follows. 
We fix a problem suite, i.e. a matrix ensemble and a coefficient 
distribution for generating problem instances (A,xo). We also fix 
a grid of 8 values in [0, 1], typically 30 values equispaced between 
0.02 and 0.99. Subordinate to this grid, we consider a series of p 
values. Two cases arise frequently: 

• Focused Search design. 20 values between pcc(<5; x) — 1/10 and 
pca(8; x) + 1/10, where p C a is the theoretically expected phase 
transition deriving from combinatorial geometry (according to 
case x £ {+,±,D}). 

• General Search design. 40 values equispaced between and 1. 
We then have a (possibly non-cartesian) grid of 8, p values in 
parameter space [0, l] 2 . At each (8,p) combination, we will take 
M problem instances; in our case M = 20. We also fix a measure 
of success; see below. 

Once we specify the problem size N, the experiment is now 
fully specified; we set n = \SN] and k — \pn\ , and generate 
M problem instances, and obtain M algorithm outputs x%, and M 
success indicators Si, i = 1, . . . M. 

A problem instance (y,A,xo) consists of n x N matrix A from 
the given matrix ensemble and a fc-sparse vector xq from the given 
coefficient ensemble. Then y — Axq. The algorithm is called with 
problem instance (y, A) and it produces a result x. We declare 
success if 

\\xo - x\\ 2 

—\\ — ii — - to1 ' 

||xo||a 

where tol is a given parameter; in our case 10~ 4 ; the variable Si 
indicates success on the i-th Monte Carlo realization. To summarize 
all M Monte Carlo repetitions, we set S — £\ Si. 

The result of such an experiment is a dataset with tuples 
(N,n,k, M, S); each tuple giving the results at one combination 
(p,S). The meta-information describing the experiment is the spec- 
ification of the algorithm with all its parameters, the problem suite, 
and the success measure with its tolerance. 

J. Estimating Phase Transitions 

From such a dataset we find the location of the phase transition 
as follows. Corresponding to each fixed value of 8 in our grid, 
we have a collection of tuples (N,n,k,M,S) with n/N = 8 
and varying k. Pretending that our random number generator makes 
truly independent random numbers, the result S at one experiment 
is binomial Bin(7r, M), where the success probability -n G [0,1]. 
Extensive prior experiments show that this probability varies from 1 
when p is well below p ca to when p is well above pea- In short, 
the success probability 

7T = ir(p\S; N). 

We define the finite-N phase transition as the value of p at which 
success probability is 50%: 

*(p\8;N) = ± at p = P {8). 

This notion is well-known in biometrics where the 50% point of the 
dose-response is called the LD50. (Actually we have the implicit 
dependence p(8) = p(5\N, tol); the tolerance in the success 
definition has a (usually slight) effect, as well as the problem size 

AT) 

To estimate the phase transition from data, we model depen- 
dence of success probability on p using generalized linear models 



(GLMs). We take a 5-constant slice of the dataset obtaining triples 
(k, M, S(k, n, N)), and model S(k, n, N) ~ Bin(7r fe ; M) where the 
success probabilities obeys a generalized linear model with logistic 
link 

logit(n) — a + bp 

where p = k/n; in biometric language, we are modeling that the 
dose-response probability, where p is the 'complexity-dose', follows 
a logistic curve. 

In terms of the fitted parameters a, 6, we have the estimated phase 
transition 

p(S) = -a/6, 
and the estimated transition width is 

w(5) = 1/6. 

Note that, actually, 

p(S) = p(8\N, tol), w(S) = w(5\N, tol) . 

We may be able to see the phase transition and its width varying 
with N and with the success tolerance. 

Because we make only M measurements in our Monte Carlo 
experiments, these results are subject to sampling fluctuations. Con- 
fidence statements can be made for p using standard statistical 
software. 

K. Tuning of Algorithms 

The procedure so far gives us, for each fully-specified combination 
of algorithm parameters A and each problem suite S, a dataset 
(A, S, S, p(5; A, S)). When an algorithm has such parameters, we 
can define, for each fixed S, the value of the parameters which gives 
the highest transition: 

p opt (5;S) = max/3(5; A, S); 

with associated optimal parameters A opt (S;S). When the results of 
the algorithm depend strongly on problem suite as well, we can also 
tune to optimize worst-case performance across suites, getting the 
minimax transition 

p m (S) — maxmin/5(<5; A, 5). 

and corresponding minimax parameters A MM (<5). This procedure was 
followed in (3) for a wide range of popular algorithms. Figure 3 of 
the main text presents the observed minimax transitions. 

L. Results: Empirical Phase Transition 

Figure [4] (which is a complete version of Figure 3 in the main text) 
compares observed phase transitions of several algorithms including 
AMP. We considered what was called in (3) the standard suite, wit 
these choices 

• Matrix ensemble: Uniform spherical ensemble(USE); each col- 
umn of A is drawn uniformly at random from the unit sphere 
in R™. 

• Coefficient ensemble: The vector xq has k nonzeros in random 
locations, with constant amplitude of nonzeros. If x = +■ 

x (i) e {o,+i} ; if x e {±,n}, Mi) e {+1,0,-1} (with 

equiprobable positive and negative entries). 

For each algorithm we generated an appropriate grid of (S, p) and 
created M — 20 independent problem instances at each gridpoint, 
i.e. independent realizations of vector x and measurement matrix A. 

For AMP we used a focused search design, focused around p CG {8). 
To reconstruct x, we run T = 1000 AMP iterations and report the 
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Fig. 4. Observed Phase Transitions for 6 Algorithms, and psE ■ AMP: method 
introduced in main text. 1ST: Iterative Soft Thresholding. IHT: Iterative Hard 
Thresholding. TST: a class of two-stage thresholding algorithms including 
subspace pursuit and CoSamp. OMP: Orthogonal Matching Pursuit. Note that 
the l\ curve coincides with the state evolution transition p S E, a theoretical 
calculation. The other curves show empirical results. 

mean square error at the final iteration. For other algorithms, we used 
the general search design as described above. For more details about 
observed phase transitions we refer the reader to (3J. 

The calculation of the phase transition curve of AMP takes around 
36 hours on a single Pentium 4 processor. 

Observed Phase transitions for other coefficient ensembles and 
matrix ensembles are discussed below in sections [O] and [P] 

M. Example of the Interference Heuristic 

In the main text, our motivation of the SE formalism used the 
assumption that the mutual access interference term MAI t = (A*A— 
I)(x — xo) is marginally nearly Gaussian - i.e. the distribution 
function of the entries in the MAI vector is approximately Gaussian. 

As we mentioned, this heuristic motivates the definition of the 
MSE map. It is easy to prove that the heuristic is valid at the first 
iteration; but for the validity of SE, it must continue to be true at 
every iteration until the algorithm stops. Figure [5] presents a typical 
example. In this example we have considered USE matrix ensemble 
and Rademacher Coefficient ensemble. Also N is set to a small size 
problem 2000 and (S,p) = (0.9,0.52). The algorithm is tracked 
across 90 iterations. Each panel exhibits a linear trend, indicating 
approximate Gaussianity. The slope is decreasing with iteration count. 
The slope is the square root of the MSE, and its decrease indicates 
that the MSE is evolving towards zero. More interestingly, figure [6] 
shows the QQplot of the MAI noise for the partial Fourier matrix 
ensemble. Coefficients here are again from Rademacher ensemble 
and (N,S,p) = (16384,0.5,0.35). 

N. Testing Predictions of State Evolution 

The last section gave an illustration tracking the actual evolution 
of the AMP algorithm, it showed that the State Evolution heuristic 
is qualitatively correct. 

We now consider predictions made by SE and their quantitative 
match with empirical observations. We consider predictions of four 
observables: 
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Fig. 5. QQ Plots tracking marginal distribution of mutual access interference 
(MAI). Panels (a)-(i): iterations 10, 20, ... , 90. Each panel shows QQ plot of 
MAI values versus normal distribution in blue, and in red (mostly obscured) 
points along a straight line. Approximate linearity indicates approximate nor- 
mality. Decreasing slope with increasing iteration number indicates decreasing 
standard deviation as iterations progress. 



Fig. 7. Comparison of State Evolution predictions against observations. 
p = .3, 6 = .15. Panels (a)-(d): MSENZ, MSE, MDR, FAR. Curve in red: 
theoretical prediction. Curve in blue: mean observable. Each panel shows 
the evolution of a specific observable as iterations progress. Two curves are 
present in each panel, however, except for the lower left panel, the blue curve 
(empirical data) is obscured by the presence of the red curve. The two curves 
are in close agreement in all panels. 
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Fig. 6. QQ Plots tracking marginal distribution of mutual access interference 
(MAI). Matrix Ensemble: partial Fourier. Panels (a)-(i): iterations 30,60,. . . , 
270. For other details, see Fig. \5\ 



Fig. 8. Comparison of State Evolution predictions against observations. 
p = 0.3, S = 0.15. For details, see Figure 



MSE on zeros and MSE on non-zeros: 

MSEZ = E[x(i) 2 \x (i) = 0], 

MSENZ = E[(£(i) - x (i)) 2 \x (i) / 0] 

Missed detection rate and False alarm rate: 



MDR : 
FAR : 



F[x(i) = 0\x o (i)^0], 
F[x(i) ^ 0\x (i) = 0] 



(48) 



(49) 



We illustrate the calculation of MDR. Other quantities are computed 
similarly. Let e = Sp, and suppose that entries in xo(i) are either 0, 



1, or -1, with ¥{x {i) = ±1} = e/2. Then, with Z ~ JV(0, 1), 
F[x(i) = 0|x o (i) + 0] = 



Pfo(l + -^Z)^0] 



P[Z?(a,b)] 



-Act, Act)] 



(50) 



with a = ((-A - 1/ct) • ■s/d, b = (A- 1/ct) ■ </$. 

In short, the calculation merely requires classical properties of the 
normal distribution. The three other quantities simply require other 
similar properties of the normal. As discussed in the main text, SE 
evolution makes an iteration-by-iteration prediction of at ; in order to 
calculate predictions of MDR, FAR, MSENZ and MSEZ, the parameters 
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Fig. 9. Comparison of State Evolution predictions against observations for 
p = 0.7, <5 = 0.36. For details, see Figure [JJ 



e and A are also needed. 

We compared the state evolution predictions with the actual values 
by a Monte Carlo experiment. We chose these triples (S,p,N): 
(0.3, 0.15, 5000), (0.5, 0.2, 4000), (0.7, 0.36, 3000). We again used 
the standard problem suite (USE matrix and unit amplitude nonzero). 
At each combination of (5, p, N), we generated M = 200 random 
problem instances from the standard problem suite, and ran the 
AMP algorithm for a fixed number of iterations. We computed the 
observables at each iteration. For example, the empirical missed 
detection rate is estimated by 



eMDR(t) = 



#{i : x*(i) = and x (i) / 0} 



#{i ■ Mi) + 0} 
We averaged the observable trajectories across the M Monte Carlo 
realizations, producing empirical averages. 

The results for the three cases are presented in Figures [7] [8] 
[9] Shown on the display are curves indicating both the theoretical 
prediction and the empirical averages. In the case of the upper row 
and the lower left panel, the two curves are so close that one cannot 
easily tell that two curves are, in fact, being displayed. 

O. Coefficient Universality 

SE displays invariance of the evolution results with respect to the 
coefficient distribution of the nonzeros. What happens in practice? 

We studied invariance of AMP results as we varied the distributions 
of the nonzeros in xq. We consider the problem x = ± and used the 
following distributions for the non-zero entries of xq: 

• Uniform in [— 1, +1]; 

• Radamacher (uniform in { + 1,-1}); 

• Gaussian; 

• Cauchy. 

In this study, N = 2000, and we considered 8 — 0.1, 0.3. For each 
value of 5 we considered 20 equispaced values of p in the interval 
[pco(<5; ±) - 1/10, poa(S; ±) + 1/10], running each time T = 1000 
AMP iterations. Data are presented, respectively, in Figures [Tol 

Each plot displays the fraction of success (S/M) as a function of p 
and a fitted success probability i.e. in terms of success probabilities, 
the curves display n(p). In each case 4 curves and 4 sets of data 



Fig. 10. Comparison of Failure probabilities for different ensembles. In the 
left window, <5 = 0.10 and in the right window 8 = 0.3. Red: unit-amplitude 
coefficients. Blue: uniform [— 1, 1]. Green: Gaussian. Black: Cauchy. Points: 
observed failure fractions Curves: Logistic fit. 



points are displayed, corresponding to the 4 ensembles. The four 
datasets are visually quite similar, and it is apparent that indeed a 
considerable degree of invariance is present. 

P. Matrix Universality 

The Discussion section in the main text referred to evidence that 
our results are not limited to the Gaussian distribution. 

We conducted a study of AMP where everything was the same 
as in Figure 1 above, however, the matrix ensemble could change. 
We considered three such ensembles: USE (columns iid uniformly 
distributed on the unit sphere), Rademacher (random entries iid ±1 
equiprobable), and Partial Fourier, (randomly select n rows from N x 
N fourier matrix.) We only considered the case \ — ±- Results are 
shown in Fig. QT| and compared to the theoretical phase transition 
for l x . 

Q. Timing Results 

In actual applications, AMP runs rapidly. 

We first describe a study comparing AMP to the LARS algorithm 
1281 . LARS is appropriate for comparison because, among the itera- 
tive algorithms previously proposed, its phase transition is closest to 
the i\ transition. So it comes closest to duplicating the AMP sparsity- 
undersampling tradeoff. 

Each algorithm proceeds iteratively and needs a stopping rule. In 
both cases, we stopped calculations when the relative fidelity measure 
exceeded 0.999, ie when \\y - Aa;*]| 2 /||y ||a < 0.001. 

In our study, we used the partial Fourier matrix ensemble with unit 
amplitude for nonzero entries in the signal xo. We considered a range 
of problem sizes (N, n, k) and in each case averaged timing results 
over M — 20 problem instances. Table U presents timing results. 

In all situations studied, AMP is substantially faster than LARS. 
There are a few very sparse situations - i.e. where k is in the tens or 
few hundreds - where LARS performs relatively well, losing the race 
by less than a factor 3. However, as the complexity of the objects 
increases, so that k is several hundred or even one thousand, LARS 
is beaten by factors of 10 or even more. 
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Fig. 11. Observed Phase Transitions at different matrix ensembles. Case 
\ = ±. Red: Uniform Spherical Ensemble (Gaussian with normalize column 
lengths). Magenta: Rademacher (±1 equiprobable). Green: partial Fourier. 
Blue: 



TABLE I 

Timing Comparison of AMP and LARS. Average Times in CPU 
seconds. 



N 


n 


k 


AMP 


LARS 


4096 


820 


120 


0.19 


0.7 


8192 


1640 


240 


0.34 


3.45 


16384 


3280 


480 


0.72 


19.45 


32768 


1640 


160 


2.41 


7.28 


16384 


820 


80 


1.32 


1.51 


8192 


820 


110 


0.61 


1.91 


16384 


1640 


220 


1.1 


5.5 


32768 


3280 


440 


2.31 


23.5 


4096 


1640 


270 


0.12 


1.22 


8192 


3280 


540 


0.22 


5.45 


16384 


6560 


1080 


0.45 


27.3 


32768 


1640 


220 


6.95 


17.53 



(For very large k, AMP has a decisive advantage. When the matrix 
A is dense, LARS requires at least ci-k-n-N operations, while AMP 
requires at most C2-n-N operations. Here C2 = log((EX 2 ) / a 2 ^) / b is 
a bound on the number of iterations, and (EX 2 )/ay is the relative 
improvement in MSE in T iterations. Hence in terms of flops we 
have 

flops(LARS) ^ kb(S, p) 
flops(AMP) " log((EX 2 )/cr 2 ) ' 

This logarithmic dependence of the denominator is very weak, and 
very roughly this ratio scales directly with k.) 

We also studied AMP's ability to solve very large problems. 

We conducted a series of trials with increasing A 7 in a case where A 
and A* can be applied rapidly, without using ordinary matrix storage 
and matrix operations; specifically, the partial Fourier ensemble. For 
nonzeros of the signal xo. we chose unit amplitude nonzeros. 

We considered the fixed choice (<5, p) = (1/6, 1/8) and N ranging 
from IK to (K = 1024) to 256^ in powers of 2. At each signal 
length N we generated M = 10 random problem instances and 
measured CPU times (on a single Pentium 4 processor) and iteration 
counts for AMP in each instance. We considered four stopping rules, 



Fig. 12. Iteration Counts versus Signal Length N . Different curves show 
results for different stopping rules. Horizontal axis: signal length N . Vertical 
axis: Number of iterations, T. Blue, Green, Red, Aqua curves depict results 
when stopping thresholds are set at 12 ■ 10~ 5 2 4 ~ f , with t = 0, 1, 2,3 Each 
doubling of accuracy costs about 5 iterations. 



based on MSE a 2 , a 2 /2, cr 2 /4, and a 2 /8, where a 2 = 12-10 -5 . We 
then averaged timing results over the M — 10 randomly generated 
problem instances 

Figure [12] presents the number of iterations as a function of the 
problem size and accuracy level. According to the SE formalism, this 
should be a constant independent of N at each fixed (S, p) and we 
see indeed that this is the case for AMP: the number of iterations is 
close to constant for all large N . Also according to the SE formalism, 
each additional iteration produces a proportional reduction in formal 
MSE, and indeed in practice each increment of 5 AMP iterations 
reduces the actual MSE by about half. 

Figure [T3] presents CPU time as a function of the problem size 
and accuracy level. Since we are using the partial Fourier ensemble, 
the cost of applying A and A* is proportional to Alog^A'); this 
is much less than what we would expect for the cost of applying 
a general dense matrix. We see that indeed AMP execution time 
scales very favorably with N in this case - to the eye, the timing 
seems practically linear with N. The timing results show that each 
doubling of N produces essentially a doubling of execution time, 
iteration produces a proportional reduction in formal MSE, and 
indeed in practice each increment of 5 AMP iterations reduces the 
MSE by about half. Each doubling of accuracy costs about 30% more 
computation time. 
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Fig. 13. CPU Time Scaling with N . Different curves show results for 
different stopping rules. Horizontal axis: signal length N. Vertical axis: CPU 
time(seconds). Blue, Green, Red, Aqua curves depict results when stopping 
thresholds are set at 12 ■ 10 _5 2 4 -*, with I = 0, 1, 2, 3 



